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Abstract 

A dataset has been classified by some unknown classifier into two types of 
points. What were the most important factors in determining the classification 
outcome? In this work, we employ an axiomatic approach in order to uniquely 
characterize an influence measure: a function that, given a set of classified points, 
outputs a value for each feature corresponding to its influence in determining the 
classification outcome. We show that our influence measure takes on an intuitive 
form when the unknown classifier is linear. Finally, we employ our influence mea¬ 
sure in order to analyze the effects of user profiling on Google’s online display 
advertising. 


1 Introduction 

A recent white house report [Podesta et ai, 2014] highlights some of the major risks in 
the ubiquitous use of big data technologies. According to the report, one of the major 
issues with large scale data collection and analysis is a glaring lack of transparency. For 
example, a credit reporting company collects consumer data from third parties, and uses 
machine learning analysis to estimate individuals’ credit score. On the one hand, this 
method is “impartial”: an emotionless algorithm cannot be accused of being malicious 
(discriminatory behavior is not hard-coded). However, it is hardly transparent; indeed, 
it is difficult to tease out the determinants of one’s credit score: it depends on the 
user’s financial activities, age, address, the behavior of similar users and many other 
factors. This is a major issue: big-data analysis does not intend to discriminate, but 
inadvertent discrimination does occur: treating users differently based on unfair criteria 
(e.g. online retailers offering different discounts or goods based on place of residence 
or past purchases). 

In summary, big data analysis leaves users vulnerable. They may be discrimi¬ 
nated against, and no one (including the algorithm’s developers!) may even know why; 
what’s worse, traditional methods for preserving user anonymity (e.g. by “opting out” 
of data collection) offer little protection; big data techniques allow companies to infer 
individuals’ data based on similar users [Barocas and Nissenbaum, 2014]. Since it is 
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often difficult to “pop the hood” and understand the inner workings of classification 
algorithms, maintaining transparency in classification is a major challenge. In more 
concrete terms, transparency can be interpreted as understanding what influences the 
decisions of a black-box classifier. This is where our work comes in. 

Suppose that we are given a dataset B of users; here, every user a G B can be 
thought of as a vector of features (e.g. a = (age, gender, IP address ...)), where the 
z-th coordinate of a corresponds to the state of the z-th feature. Each a has a value 
z;(a) (say, the credit score of a). We are interested in the following question: given a 
dataset B of various feature vectors and their values, how influential was each feature 
in determining these values? 

In more detail, given a set TV = {1,... ,n} of features, a dataset B of feature 
profiles, where every profile a has a value z;(a), we would like to compute a measure 
B, v) that corresponds to feature z’s importance in determining the labels of the 
points in B. We see this work as an important first step towards a concrete methodology 
for transparency analysis of big-data algorithms. 

Our Contribution: We take an axiomatic approach — which draws heavily on co¬ 
operative game theory — to define an influence measure. The merit of our approach 
lies in its independence of the underlying structure of the classification function; all we 
need is to collect data on its behavior. 

We show that our influence measure is the unique measure satisfying some natural 
properties (Section 2). As a case study, we show that when the input values are given 
by a linear classifier, our influence measure has an intuitive geometric interpretation 
(Section 3). Finally, we show that our axioms can be extended in order to obtain other 
influence measures (Section 4). For example, our axioms can be used to obtain a 
measure of state influence, as well as influence measures where a prior distribution on 
the data is assumed, or a measure that uses pseudo-distance between user profiles to 
measure influence. 

We complement our theoretical results with an implementation of our approach, 
which serves as a proof of concept (Section 5). Using our framework, we identify ads 
where certain user features have a significant influence on whether the ad is shown to 
users. Our experiments show that our influence measures behave in a desirable manner. 
In particular, a Spanish language ad — clearly biased towards Spanish speakers — 
demonstrated the highest influence of any feature among all ads. 

1.1 Related Work 

Axiomatic characterizations have played an important role in the design of provably 
fair revenue divisions [Shapley, 1953; Young, 1985; Banzhaf, 1965; Fehrer, 1988]. 
Indeed, one can think of the setting we describe as a generalization of cooperative 
games, where agents can have more than one state — in cooperative games, agents are 
either present or absent from a coalition. Some papers extend cooperative games to 
settings where agents have more than one state, and define influence measures for such 
settings [Chalkiadakis et al, 2010; Zick et al, 2014]; however, our setting is far more 
general. 
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Our definition of influence measures the ability of a feature to affect the classifi¬ 
cation outcome if changed (e.g. how often does a change in gender cause a change in 
the display frequency of an ad); this idea is used in the analysis of cause [Halpern and 
Pearl, 2005; Tian and Pearl, 2000], and responsibility [Chockler and Halpern, 2004]; 
our influence measure can be seen as an application of these ideas to a classification 
setting. 

Influence measures are somewhat related to feature selection [Blum and Langley, 
1997]. Feature selection is the problem of finding the set of features that are most rel¬ 
evant to the classification task, in order to improve the performance of a classifier on 
the data; that is, it is the problem of finding a subset of features, such that if we train 
a classifier using just those features, the error rate is minimized. Some of the work on 
feature selection employs feature ranking methods; some even use the Shapley value 
as a method for selecting the most important features [Cohen et al, 2005]. Our work 
differs from feature selection both in its objectives and its methodology. Our measures 
can be used in order to rank features, but we are not interested in training classifiers; 
rather, we wish to decide which features influence the decision of an unknown classi¬ 
fier. That said, one can certainly employ our methodology in order to rank features in 
feature selection tasks. 

When the classifier is linear, our influence measures take on a particularly intu¬ 
itive interpretation as the aggregate volume between two hyperplanes [Marichal and 
Mossinghoff, 2006]. 

Recent years have seen tremendous progress on methods to enhance fairness in 
classification [Dwork et al, 2012; Kamishima et al., 2011], user privacy [Balebako et 
al, 2012; Pedreschi et al., 2008; Wills and Tatar, 2012] and the prevention of discrim¬ 
ination [Kamiran and Calders, 2009; Calders and Verwer, 2010; Luong et al., 2011]. 
Our work can potentially inform all of these research thrusts: a classifier can be deemed 
fair if the influence of certain features is low; for example, high gender influence may 
indicate discrimination against a certain gender. In terms of privacy, if a hidden feature 
(i.e. one that is not part of the input to the classifier) has high influence, this indicates 
a possible breach of user privacy. 

2 Axiomatic Characterization 

We begin by briefly presenting our model. Given a set of features N = n}, 

let Ai be the set of possible values, or states that feature i can take; for example, the 
i-th feature could be gender, in which case Ai = {male, female, other}. We are given 
partial outputs of a function over a dataset containing feature profiles. That is, we are 
given a subset B of A = IlieAr ^ valuation r;(a) for every a € B. By given, we 

mean that we do not know the actual structure of v, but we know what values it takes 
over the dataset B. Formally, our input is a tuple Q = {N, B, v), where v : A ^ Q 
is a function assigning a value of ti(a) to each data point a G B. We refer to Q as the 
dataset. When v{a) € {0,1} for all a G B, v is a binary classifier. When B — A 
and \Ai\ = 2 for all i G N, the dataset corresponds to a standard TU cooperative 
game [Chalkiadakis et al, 2011] (and is a simple game if ti(a) G {0,1}). 

We are interested in answering the following question: how influential is feature 


3 



i? Our desired output is a measure that will be associated with each feature i. 

The measure (j>i{Q) should be a good metric of the importance of i in determining the 
values of v over B. 

Our goal in this section is to show that there exists a unique influence measure 
that satisfies certain natural axioms. We begin by describing the axioms, starting with 
symmetry. 

Given a dataset Q = {N,B,v) and a bijective mapping a from N to itself, we 
define aQ = {aN,aB,av) in the natural way: aN has all of the features relabeled 
according to cr (i.e. the index of i is now a(i)); aB is {era | a € Bj, and av(aa) — 
v(a) for all era S aB. Given a bijective mapping t : Ai ^ Ai over the states of some 
feature i G N, we define tG = {N, tB, tv) in a similar manner. 

Definition 2.1. An influence measure (j) satisfies Ihe feature symmetry property if it 
is invariant under relabelings of features: given a dataset Q = {N,B,v) and some 
bijection a : N ^ N, (f>i{G) = fa(i){<^G) for all i G N. A influence measure <j) 
satisfies the state symmetry property if it is invariant under relabelings of states: given a 
dataset G = {N, B, v), some i G N, and some bijection t : Ai ^ Ai, (G) = 4>j {jG) 
for all j G N. Note that it is possible that i j. A measure satisfying both state and 
feature symmetry is said to satisfy the symmetry axiom (Sym). 

Feature symmetry is a natural extension of the symmetry axiom defined for coop¬ 
erative games (see e.g. [Banzhaf, 1965; Lehrer, 1988; Shapley, 1953]). However, state 
symmetry does not make much sense in classic cooperative games; it would translate 
to saying that for any set of players S C N and any j G N, the value of i is the same 
if we treat S' as S' \ {j}, and S \ {j} as S. While in the context of cooperative games 
this is rather uninformative, we make non-trivial use of it in what follows. 

We next describe a sufficient condition for a feature to have no influence: a feature 
should not have any influence if it does not affect the outcome in any way. Formally, a 
feature i G N is a dummy if u(a) = v{a_i, b) for all a G B, and all b G Ai such that 
(3—ij b) G B. 

Definition 2.2. An influence measure f satisfies the dummy property if (f>i{G) = 0 
whenever z is a dummy in the dataset G- 

The dummy property is a standard extension of the dummy property used in value 
characterizations in cooperative games. However, when dealing with real datasets, it 
may very well be that there is no vector a G B such that (a_j, 6) G i?; this issue is 
discussed further in Section 6. 

Cooperative game theory employs a notion of value additivity in the characteriza¬ 
tion of both the Shapley and Banzhaf values. Given two datasets = {N, B,vi),G 2 = 
{N,B,V 2 ), we define G = {N,A,v) = Gi + G 2 with z;(a) = ui(a) + V 2 {a) for all 
aG B. 

Definition 2.3. An influence measure f satisfies additivity (AD) if (j>i{Gi + G 2 ) = 
+ (/'i(^ 2 ) for any two datasets Gi = {N,B,vi),G 2 = {N,B,V 2 ). 

The additivity axiom is commonly used in the axiomatic analysis of revenue di¬ 
vision in cooperative games (see [Lehrer, 1988; Shapley, 1953]); however, it fails to 
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capture a satisfactory notion of influence in our more general setting. We now show 
that any measure that satisfies additivity, in addition to the symmetry and dummy prop¬ 
erties, must evaluate to zero for all features. To show this, we first define the following 
simple class of datasets. 

Definition 2.4. Let lA^ = {TV, A, Ug} be the dataset defined by the classifier Ug, where 
tta(a') = 1 if a' = a, and is 0 otherwise. The dataset Ug is referred to as the singleton 
dataset over a. 

It is an easy exercise to show that additivity implies that for any scalar a S Q, 
'PiiciQ) = Oi(j>i{Q), where the dataset aG has the value of every point scaled by a factor 
of a. 

Proposition 2.5. Any influence measure that satisfles the (Sym), (D) and (AD) axioms 
evaluates to zero for all features. 

Proof First, we show that for any a, a' € A and any b G Ai, it must be the case that 
■ ,b))■ This is true because we can define a bijective mapping 
fromWfa . {,) to^a' .,b)- for every j G N\{i}, we swap aj and a'. By state symmetry, 

fi{^{g^i.b)) ~ fi(J^{g^_^,b))‘ 

Next, if ({) is additive, then for any dataset ^ = {N, B,v), (j)i{G) = 

That is, the influence of a feature must be the sum of its influenceover singleton 
datasets, scaled by v(a). 

Now, suppose for contradiction that there exists some singleton dataset Ug (a G B) 
for which some feature i € N does not have an influence of zero. That is, we assume 
that fiiUf) f 0. We define a dataset Q = (TV, A, v) in the following manner: for all 
a. G A such that a_i = a_i, we set r;(a) = 1, and v{a.) = 0 if a_i f a^i. In the 
resulting dataset, u(a) is solely determined by the values of features in TV\ {z}; in other 
words z;(a) = z;(a_j, b) for all b G Ai, hence feature z is a dummy. According to the 
dummy axiom, we must have that fflG) = 0; however, 

G = UQ)= Y. = E 

a:i'{a) —1 b^Ai 

= E > 0, 

b&Ai 

where the first equality follows from the decomposition of G into singleton datasets, 
and the third equality holds by Symmetry. This is a contradiction. □ 

As Proposition 2.5 shows, the additivity, symmetry and dummy properties do not 
lead to a meaningful description of influence. A reader familiar with the axiomatic 
characterization of the Shapley value [Shapley, 1953] will find this result rather dis¬ 
appointing: the classic characterizations of the Shapley and Banzhaf values assume 
additivity (that said. The axiomatization by Young [1985] does not assume additivity). 

We now show that there is an influence measure uniquely defined by an alternative 
axiom, which echoes the union intersection property described by Lehrer [1988]. In 
what follows, we assume that all datasets are classified by a binary classifier. We write 
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W(B) to be the set of all profiles in B such that v(a) = 1, and L(B) to be the set of all 
profiles in B that have a value of 0. We refer to W{B) as the winning profiles in B, and 
to L{B) as the losing profiles in B. We can thus write fifiW(B), L{B)), rather than 
4>i{G)- Given two disjoint sets W,L Q A, we can define the dataset as ^ = {W, L), and 
the influence oHs.^ without explicitly writing TV, B and v. As we have seen, 

no measure can satisfy the additivity axiom (as well as symmetry and dummy axioms) 
without being trivial. We now propose an alternative influence measure, captured by 
the following axiom: 

Definition 2.6. An influence measure satisfies the disjoint union (DU) property if for 
any Q Q A, and any disjoint R, R' C A\Q, R) + 4>i(Q, R') = R U R'), 
and 4)i{R, Q) + fifiR', Q) = fifiRtJ R', Q). 

An influence measure satisfying the (DU) axiom is additive with respect to in¬ 
dependent observations of the same type. Suppose that we are given the outputs of a 
binary classifier on two datasets: Qi = (W, Lf) and = (lU, Lf). The (DU) axiom 
states that the ability of a feature to affect the outcome on Gi is independent of its 
ability to affect the outcome in G 2 , if the winning states are the same in both datasets. 

Replacing additivity with the disjoint union property yields a unique influence mea¬ 
sure, with a rather simple form. 

= ^ iv(a_i,b)-v(a)j (1) 

aGB&GAi:(a_i,6)eB 

X measures the number of times that a change in the state of i causes a change in the 
classification outcome. If we normalize x and divide by \B\, the resulting measure has 
the following intuitive interpretation: pick a vector a G B uniformly at random, and 
count the number of points in Ai for which (a_i, h) G B and i changes the value of a. 
We note that when all features have two states and B = A, x coincides with the (raw) 
Banzhaf power index [Banzhaf, 1965]. 

We now show that x is a unique measure satisfying (D), (Sym) and (DU). We begin 
by presenting the following lemma, which characterizes influence measures satisfying 
(D), (Sym) and (DU) when dataset contains only a single feature. 

Lemma 2.7. Let be an influence measure that satisfies state symmetry, and let Q\ = 
{{i\, Ai,vi) and Q 2 = {{i\, Ai,V 2 ) be two datasets with a single feature i; if the 
number of winning states under Gi and G 2 is identical, then 4>i{Gi) = 4>i{G2)- 

Proof Sketch. We simply construct a bijective mapping from the winning states of i 
under Qi and its winning states in Q 2 . By state symmetry, ffiGi) = 4>i{G2)- n 

Lemma 2.7 implies that for single feature games, the value of a feature only de¬ 
pends on the number of winning states, rather than their identity. 

We are now ready to show the main theorem for this section: x is the unique influ¬ 
ence measure satisfying the three axioms above, up to a constant factor. 

Theorem 2.8. An influence measure f satisfies (D), (Sym) and (DU) if and only if there 
exists a constant C such that for every dataset G = (TV, B, v) 

MG) = C-XiiG). 
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Proof. It is an easy exercise to verify that % satisfies the three axioms, so we focus on 
the “only if” direction. 

We present our proof assuming that we are given the set A as data; the proof goes 
through even if we assume that we are presented with some arbitrary B (- A. Let us 
write W = W (^) and L = L{A). Given some a_i S A-i, we write La_i = {a G L | 
a_i = a_i}, and ILa_, = {a G VL | a_i = a„i}. 

Using the disjoint union property, we can decompose (t>i{W, L) as follows: 

UW,L) = E E (2) 

a-iGA-i a_ieA_i 

Now, if a_i ^ a_i, then feature i is a dummy given the dataset provided. Indeed, 
state profiles are either in Wa_j or in L^_r, that is, if u(a_i, b) = 0, then (a_i, b) is 
unobserved, and if u(a_i, b) = 1, then (a_j, b) is unobserved. We conclude that 

MW,L)= Y. (3) 

a-iGA-i 

Let us now consider 4>i{Wa_i, Lg^ ^). Since f satisfies state symmetry. Lemma 2.7 
implies that fi can only possibly depend on a_i, |Wa_J and |La_i|- Next, for any 
a_i and a'_j such that |La_J = |La' I = |Wa' .|, so by Lemma 2.7 

= fiiyVai , Lg' ). In Other words Only depends ou | Wa_J, | La_i |, 

and not on the identity of a_j. 

Thus, one can see for a single feature as a function of two parameters, w and 
I in N, where w is the number of winning states and I is the number of losing states. 
According to the dummy property, we know that (j>i{w, 0) = 1) = 0; moreover, 

the disjoint union property tells us that (f>i{x, 1) + 4>i{y, 1) = fiix + y, 1), and that 
(j)i{w, x) + y) = (j)i{w, X + y). We now show that 1) = l)wl. 

Our proof is by induction on w + L For w + I = 2 the claim is clear. Now, assume 
without loss of generality that w > 1 and / > 1; then we can write w = x + y for 
a;, y G N such that 1 < a;, y < w. By our previous observation, 

(j)r{w,l) = (j)r{x,l) +(j)i{y,l) 

*=■ l)xl + l)yl = l)wl. 

Now, 1) is the influence of feature i when there is exactly one losing state profile, 
and one winning state profile. We write (/)i(l, 1) = Cj. 

Let us write Wi(a_i) = {b € Ai \ v(a_i, b) = 1} and Li{a^i) = \ Wi(a_i). 

Thus, I Wa_j I = \ Wi{a-i)\, and \Lg ^ \ = |Li(a_i)|. Putting it all together, we get that 

4>iiG) = Ci Y \Wi{a-i)\ ■ \Li{a-i)\ (4) 

a^iGA-i 

We just need to show that the measure given in (4) equals Xi (modulo q). Indeed, (4) 

equals X:aGA: v{a)=o IW"*(a_i)|, whichin tumequals X:aGA: «(a )=0 E&gA, Ha-i,b)- 

v(a)|. Similarly, (4) equals 

E E 

aeA:i'(a) —1 bGAi 
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Thus, 


si-i^A—i a^AbGAi 

in particular, for every dataset Q = {N, A, v) and every i G N, there is some con¬ 
stant Ci such that (j)i{Q) = CiXi{G)- To conclude the proof, we must show that 
Ci = Cj for all i,j G N. Let a : N — 5- iV be the bijection that swaps i and j; 
then (j)i{g) = 4>a(i){^G)- By feature symmetry, CiXiiG) = (fiiiG) = 4'cr{i){o'G) = 
(l^jicrG) = CjXjicrG) = CjXiiG), thus Ci = Cj. □ 


3 Case Study: Influence for Linear Classiflers 


To further ground our results, we now present their application to the class of linear 
classifiers. For this class of functions, our influence measure takes on an intuitive 
interpretation. 

A linear classifier is defined by a hyperplane in R"; all points that are on one side 
of the hyperplane are colored blue (in our setting, have value 1), and all points on the 
other side are colored red (have a value of 0). Formally, we associate a weight Wi gM. 
with every one of the features in N (we assume that Wi 0 for all i G Nfi a point 
X e R” is blue if x • w > q, where q S R is a given parameter. The classification 
function v : R” {0,1} is given by 

Jl ifx-w>g 

z;(x) = < (5) 

0 otherwise. 


Fixing the value of Xi to some 5 € R, let us consider the set Wi{b) = {x_i G 
R”“^ I u(x_i, b) = 1}; we observe that if & < &' and Wi > 0, then Wi{b) C Wi{b') (if 
Wi < 0 then Wfib') C Wfib)). Given two values b, b' G R, we denote by 

Di{b, b') = {x_i e R”“^ I w(x_j, b) ^ r)(x_i, &')}■ 

By our previous observation, if 5 < 5' then Dfib, b') = Wi{b') \ Wi{b), and if & > &' 
then Di{b, b') = Wfib) \ Wi{b'). 

Suppose that rather than taking points in R", we only take points in [0,1]"; then 
we can define \Dfib, b')\ = Vol{Di{b, b')), where 


Vol{Di{b,b')) = [ \v{x_i,b') - v{x_i,b)\dx_i. 

In other words, in order to measure the total influence of setting the state of fea¬ 
ture i to b, we must take the total volume of Di{b,b') for all 6' € [0,1], which 
equals /Xo Vol{Di{b,b'))db. Thus, the total influence of setting the state of i to 6 
is /xg[o i]n b) — v(x)|9x. The total influence of i would then be naturally the 

total influence of its states, i.e. 


f f \v{x-i,b) - v{x.)\dxdb. 

Jb=o Jxe[o.i]" 


( 6 ) 



The formula in Equation (6) is denoted by Xi(w;(jf). Equation (1) is a discretized 
version of Equation (6); the results of Section 2 can be extended to the continuous 
setting, with only minimal changes to the proofs. 

We now show that the measure given in (6) agrees with the weights in some natural 
manner. This intuition is captured in Theorem 3.1 (proof omitted). 

Theorem 3.1. Let v be a linear classifier defined by w and q; then Xi{G) ^ XjiG) if 
and only if \wi\ > ItUjI. 

Given Theorem 3.1, one would expect the following to hold; suppose that we are 
given two weight vectors, w, w' € K." such that Wj = w' for all j f i, but Wi < w[. 
Let V be the linear classifier defined by w and q and v' be the linear classifier defined 
by w' and q. Is it the case that feature i is more influential under v' than under vl 
In other words, does influence monotonicity hold when we increase the weight of an 
individual feature? The answer to this is negative. 

Example 3.2. Let us consider a single feature game where = {!}, = [0,1], and 

v{x) = 1 if wx > q, and v{x) =0 if wx < q for a given w > q. The fraction of times 
that 1 is pivotal is 

|Pwi| = f j I{v{h)=\/\v{x)=Q)dxdh] 

J b—0 J x—0 

simplifying, this expression is equal to (l — We can show that xi = ‘2.\Pivi\ , 

we have that xi is maximized when q = 2w, in particular, xi is monotone increasing 
when q < w < 2q, and it is monotone decreasing when w > 2q. 

Example 3.2 highlights the following phenomenon: fixing the other features to be 
a_i, the influence of i is maximized when |Ta_i| = This can be interpreted 

probabilistically: we sample a random feature from B, and assume that for any fixed 
a_i G A-i, Pr[u(a_i, &) = !] = i. The better a feature i agrees with our assumption, 
the more i is rewarded. More generally, an influence measure satisfies the agreement 
with prior assumption (APA) axiom if for any vector (pi,... ,p„) G [0,1]", and any 
fixed a_i G A-i, i’s influence increases as | Pr[u(a_i, b) = 1] —pi\ decreases. A vari¬ 
ant of the symmetry axiom (that reflects changes in probabilities when labels change), 
along with the dummy and disjoint union axioms can give us a weighted influence 
measure as described in Section 4.2, that also satisfies the (APA) axiom. 

4 Extensions of the Feature Influence Measure 

Section 2 presents an axiomatic characterization of feature influence, where the value 
of each feature vector is either zero or 1. We now present a few possible extensions of 
the measure, and the variations on the axioms that they require. 

4.1 State Influence 

Section 2 provided an answer to questions of the following form: what is the impact 
of gender on classification outcomes? The answer provided in previous sections was 
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that influence was a function of the feature’s ability to change outcomes by changing 
its state. 

It is also useful to ask a related question; what is the impact of the gender feature 
being set to “female” on classification outcomes? In other words, rather than measuring 
feature influence, we are measuring the influence of feature i being in a certain state. 
The results described in Section 2 can be easily extended to this setting. Moreover, 
the impossibility result described in Proposition 2.5 no longer holds when we measure 
state — rather than feature — influence: we can replace the disjoint union property 
with additivity to obtain an alternative classification of state influence. 

4.2 Weighted Influence 

Suppose that in addition to the dataset B, we are given a weight function w : B ^ W. 
w(a) can be thought of as the number of occurrences of the vector a in the dataset, 
the probability that a appears, or some intrinsic importance measure of a. Note that 
in Section 2 we implicitly assume that all points occur at the same frequency (are 
equally likely) and are equally important. A simple extension of the disjoint union 
and symmetry axioms to a weighted variant shows that the only weighted influence 
measure that satisfies these axioms is 

xT(B) = J2 w(a)lv(a_i,b)-v(a)l. 

aes 

4.3 General Distance Measures 

Suppose that instead of a classifier v : A {0,1} we are given a pseudo-distance 
measure; that is, a function d : A x A —> R that satisfies d(a, a') = d(a', a), d(a, a) = 
0 and the triangle inequality. Note that it is possible that d(a, a') = 0 but a ^ a'. An 
axiomatic analysis in such general settings is possible, but requires more assumptions 
on the behavior of the influence measure. Such an axiomatic approach leads us to show 
that the influence measure 

xfiB) = J2 d{{a_„b),a) 

aGS bGAi-.{a.-i,b)£B 

is uniquely defined via some natural axioms. The additional axioms are a simple ex¬ 
tension of the disjoint union property, and a minimal requirement stating that when 
B = {a, (a_i, b)}, then the influence of a feature is ad{{a-i, b), a) for some constant 
a independent of i. The extension to pseudo-distances proves to be particularly useful 
when we conduct empirical analysis of Google’s display ads system, and the effects 
user metrics have on display ads. 

5 Implementation 

We implement our influence measure to study Google’s display advertising system. 
Users can set demographics (like gender or age) on the Google Ad Settings page'; 

'google.com/settings/ads 
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these are used by the Google ad serving algorithm to determine which ads to serve. We 
apply our influence measure to study how demographic settings influence the targeted 
ads served by Google. We use the AdFisher tool Datta et al. [2014] for automating 
browser activity and collect ads. 

We pick the set of features: N = {gender, age, language}. Feature states are 
{male, /ema/e} for gender, {18—24, 35—44,55—64} for age, and {English, Spanish} 
for language; this gives us 2 x 3 x 2 = 12 possible user profiles. Using AdFisher, 
we launch twelve fresh browser instances, and assign each one a random user profile. 
For each browser instance, the corresponding settings are applied on the Ad Settings 
page, and Google ads on the BBC news page bbc . com/news are collected. For each 
browser, the news page is reloaded 10 times with 5 second intervals. 

To eliminate ads differing due to random chance, we collect ads over 100 itera¬ 
tions, each comprising of 12 browser instances, thereby obtaining data for 1200 sim¬ 
ulated users. In order to minimize confounding factors such as location and system 
specifications, all browser instances were run from the same stationary Ubuntu ma¬ 
chine. The 1200 browsers received a total of 32,451 ads (763 unique); in order to 
reduce the amount of noise, we focus only on ads that were displayed more than 100 
times, leaving a total of 55 unique ads. Each user profile a thus has a frequency 
vector of all ads v'ia) G where the coordinate is the number of times ad 
k appeared for a user profile a. We normalize u'(a) for each ad by the total num¬ 
ber of times that ad appeared. Thus we obtain the final value-vectors by computing 
^ (a) ^€ {1,...,55}. 

Since user profile values are vectors, we use the general distance influence measure 
described in Section 4.3. The pseudo-distance we use is Cosine similarity: cosd{x,y) = 
1 — ; this has been used Cosine similarity has been used by Tschantz et al. 

[2014] and Guha et al. [2010] to measure similarity between display ads. The influ¬ 
ence measure for gender, age, and language were 0.124, 0.120, and 0.141 respectively; 
in other words, no specific feature has a strong influence over ads displayed. 

We next turn to measuring feature effects on specific ads. Eixing an ad k, we define 
the value of a feature vector to be the number of times that ad k was displayed for users 
with that feature vector, and use x to measure influence. 

We compare the influence measures for each attribute across all the ads and identify 
the top ads that demonstrate high influence. The ad for which language had the highest 
influence (0.167) was a Spanish language ad, which was served only to browsers that 
set ‘Spanish’ as their language on the Ad Settings page. Comparing with statistics like 
mean and maximum over measures across all features given in Table 1, we can see that 
this influence was indeed high. 

To conclude, using a general distance measure between two value-vectors, we iden¬ 
tify that language has the highest influence on ads. By using a more fine-grained 
distance function, we can single out one ad which demonstrates high influence for 
language. While in this case the bias is acceptable, the experiment suggests that our 
framework is effective in pinpointing biased or discriminatory ads. 
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Statistic 

Gender 

Age 

Language 

Max 

0.07 

0.0663 

0.167 

Min 

0.00683 

0.00551 

0.00723 

Mean 

0.0324 

0.0318 

0.0330 

Median 

0.0299 

0.0310 

0.0291 

StdDev 

0.0161 

0.0144 

0.024 


Table 1: Statistics over influence measures across features. 

6 Conclusions and Future Work 

In this work, we analyze influence measures for classification tasks. Our influence 
measure is uniquely defined by a set of natural axioms, and is easily extended to other 
settings. The main advantage of our approach is the minimal knowledge we have of 
the classification algorithm. We show the applicability of our measure by analyzing 
the effects of user features on Google’s display ads, despite having no knowledge of 
Google’s classification algorithm (which, we suspect, is quite complex). 

Dataset classification is a useful application of our methods; however, our work 
applies to extensions of TU cooperative games where agents have more than two states 
(e.g. OCF games [Chalkiadakis et al, 2010]). 

The measure x is trivially hard to compute exactly, since it generalizes the raw 
Banzhaf power index, for which this task is known to be hard [Chalkiadakis et al, 
2011]. That said, both the Shapley and Banzhaf values can be approximated via random 
sampling [Bachrach et al., 2010]. It is straightforward to show that random sampling 
provides good approximations for x as well, assuming a binary classifier. 

Our results can be extended in several ways. The measure x is the number of times 
a change in a feature’s state causes a change in the outcome. However, a partial dataset 
of observations may not contain any pair of vectors a, a' G B, such that a' = (a_j, b). 
In Section 5, we control the dataset, so we ensure that all feature profiles appear. How¬ 
ever, other datasets would not be as well-behaved. Extending our influence measure 
to accommodate non-immediate influence is an important step towards implementing 
our results to other classification domains. Indeed, the next step of our work is ana¬ 
lyzing large-scale datasets, in order to better understand the ideas behind our influence 
measure. 

Finally, our experimental results, while encouraging, are illustrative rather than 
informative: they tell us that Google’s display ads algorithm is clever enough to assign 
Spanish ads to Spanish speakers. Our experimental results enumerate the number of 
displayed ads', this is not necessarily indicative of users’ clickthrough rates. Since our 
users are virtual entities, we are not able to measure their clickthrough rates; a broader 
experiment, where user profiles correspond to actual human subjects, would provide 
better insights into the effects user profiling has on display advertising. 
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Appendix: 

Influence in Classiflcation 


A Proof of Theorem 3.1 


We define Pivi{b) = {x G [0, 1]" | v(x) = 1, v(x-i, h) = 0}, to be the set of all piv¬ 
otal vectors (w.r.t. h), and A-P/vi(6) = {x G [0, 1]" | r'(x) = 0, v{x-i, b) = 1} to be 
the set of all anti-pivotal vectors. We write Pivi = { (x, b) G [0,1]"+^ | x G Pivi{b)} 
andA-P/vi = {(x, 6) G [0,1]"+^ | x G A-P/vi(&)}. We note that Vol{Pivi) = Vol{A-Pivi). 
Given a point (x, h) G PiVi, we know that v{'x.) = 0 but v(x-i, b) = 1. Therefore, the 
point ((x_i, b), Xi) is inA-Pivi. We conclude that 


Xi= \PiVi{b)\-\-\A-Pivi{b)\db 
Jb=0 

= [ Vol{Piv^ib))db+ [ Vol{A-Piv^{h))db 

J b—0 J b—0 

= Vol{PiVi) + Vol{A-Pivi) = 2Vol{PiVi) 


We begin by stating a few technical lemmas. Our objective is to establish some volume¬ 
preserving transformations between vectors for which j is pivotal, and vectors for 
which i is pivotal. 

Thus, to show that Xi ^ Xj whenever Wi > Wj > 0, it suffices to show that 
Vol{PiVi) > Vol{PiVj). 

Lemma A.l. Suppose that Wi > Wj > 0; ifx G Pivj(b) \ Pivi(h) then Xi > Xj. 

Proof. First, note that if v(x_j , b) = 1 but v(x) = 0, then xj < b. Now, suppose that 
Xi < Xj', we show that (x_j , b) ■ w < (x_i, b) ■ w. Indeed, 

(x_j, &) • w <(x_i, 6) • w 
XiWi + bwj <XjWj + bwi 
XiWi — XjWj <b(wi — Wj) 


Thus, we just need to show that XiWi — XjWj < b(wi — Wj). Since Xi < Xj, XiWi — 
XjWj < Xj(wi — Wj), and since Wi > Wj, this is at most b{wi — Wj), as required. 
This means that if Xi < Xj then x G Pivi{b), which concludes the first part of the 
proof. □ 


Let fij : R” —> R” be the transformation 


/*j(x)fc 


{ Xi if fc = j 
Xj if k = i 
Xk otherwise. 


Lemma A.2. Ifx G Pivj{b) \ Pivi{b) then fij{x) ^ Pivj{b) A-Pivj{b). 
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Proof. First, note that (6 — Xj){wi — wf) > 0; this is because b > Xj and Wi > Wj. 
This implies that xjWi + bwj < bwt + XjWj. Now, since r;(x_j,6) = 0, we know 
that bwi + XjWj < q - therefore, w • (/^ (x)_j, 6) = XkWk + 

XjWi + bwj < q, and v{fij(x.)-j,b) = 0. This implies that /ij(x) ^ Pivj{b). 

Now, {xi — Xj){wi — Wj) > 0 since Xi > Xj by Lemma A.l. Therefore, xjWi + 
XiWj < XiWi + XjWj < q — '^k^i j ^kWk, which implies that w • /ij(x) < q, hence 
v{fij{x.)) = 0. In particular, fij{x.) ^ A-Pivj{b). □ 

Lemma A.3. Suppose Wi > Wj > 0 and that x G Pwj{b) \ Pivi(b); if fij{x.) ^ 
Pivi{b) then Xi >b > Xj . 

Proof Suppose that b > Xi > Xj. We note that (6 — Xi){wi — Wj) > 0, which implies 
that bwi + XiWj > XiWi + bwj > q - J2k^i,j ^kWk- Hence, w • {fij{'x.)_i,b) > q, 
which implies that fij(pf) G Pivi{b). Thus, if fij{^) f. Pivi{b), it must be the case 
that Xi > b > Xj. □ 

Given some x G [0,1]” and some b G [0,1], we define gtj : [0,1]” x [0,1] ^ [0,1]" 
as follows: 

{ Xj if k = i 
b if k= j 
Xk otherwise. 

Lemma A.4. If x € Pivj{b) \ Piviib) and fijipf) ^ Pivi{b), then gij{x,b) G 

Pivi{xi) \ {Pivj{xi) ilA-Pivj{xi)). 

Proof. First, we observe that {gij{x, b)-i,Xi) = (x_j, b), and that {gij{x, b)-j,Xi) = 
fiji'yf). As observed in Lemma A.2, if Xi > Xj then v(/y (x)) = 0. Therefore, 
gij{x,b) Pivj{xj). Moreover, since x G Pivj{b), v{gij{x,b)) = 1, so gij{x,b) G 
Pivi{xi). On the other hand, [b — Xj){wi — Wj) > 0, so XjWi + bwj < bwi + XjWj < 
9 - Y^k^i.j ^kWk, so 5 ij(x, 6) • w < g. This means that gij{x, b) ^ A-Pivj{xi). □ 

Given a set S' C M™ and a function / : M™ —> R™, we define /(S) = {/(s) | 
s G S}. We can extend fij and defined above to functions from R.”+^ to ]R"+^ 
as follows. Given a point (x, 6) G we define Fij{x,b) = {fij{x.),b), and 

Gij (x, b) = {gij (x, &), Xi). We note that both Fij and Gij merely swap coordinates in 
their inputs, thus they preserve distances: 

d(Gy(x,6),Gy(y,c)) = d((x, 6), (y, c)) 

for any metric d. Isoperimetric transformations are known to preserve volume: if I : 
R™ R™ is an isoperimetry, then Vol{S) = Vol{I{S)) for any S C R™. 

Theorem A.5. Ifwi > Wj > 0 then Vol{Pivj) < Vol{Pivi). 

Proof. We partition Pivj as follows. We denote 

Aij = Pivj n Pivi, 

B^j = {(x, b) G PiVj \ Piv^ \ {fj (x), b) G PiVi} , and 
Gij = {(x, b) G Pivj \ Pivi I {fj (x), b) ^ Pivi} . 
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Clearly, Aij , Bij and Cij partition Pivj . 

According to Lemma A.2, Fij{Bij) C Pivi \ Pivj. Now, let us observe Cij. Ac¬ 
cording to Lemma A.4, Cij {Cij ) C Pivi \ PiVj . It remains to show that Fij {Bij ) n 
Cij {Cij) = 0. Suppose that there are some (x, 6) G By , (z, c) G Cij such that 
{fij{x.),b) = {gij{z,c), Zi). This means that (z, c) = ((x_j, &), Xi). To prove a contra¬ 
diction, it suffices to show that if (x, b) G By then we have that ((x_j, b),Xi) ^ Cij. 
In order to be in Cy, it must be the case that fij{x-i, b) ^ Pivi{xi)', we show that 
fij{x-i,b) G Pivi{xi). First, let us write fij{x-i,b) = y. We note that yk = Xk 
for all k 7 ^ i,j, that yj = b, and that yi = Xj. Since b > Xj, it must be the 
case that {b — Xj){wi — Wj) > 0, hence bwi -\- XjWj > XjWi + bwj. Therefore, 
w • y < w • (x_j,6). Now, since (x, &) G PiVj \ PiVi, it must be the case that 
v(x_j, b) = 0, i.e. that w • (x_y b) < q. This means that v{y) = 0. We now show that 
v{y-i,Xi) = 1. Since yi = Xj and r/^ = 6, (y_yXi) = {x-j,b). Since (x, 6) G PiVj, 
v{y-i,Xi) = v{x-j,b) = 1. Therefore, y G PiVi, and thus ((x_y6),Xi) ^ Cij. We 
conclude that indeed Fij {Bij) n Cij {Cij) = 0 . 

To conclude. 


Vol{Pivj) =Vol{A,j) + yo/(By) -b Vol{Cij) 

= Vol{Aij) + Vol{Fij{Bij)) -\- Vol{Gij{Cij)) 
< Vol{PiVi) 


which concludes the proof. □ 

Corollary A.6. Let Q = {N, [0,1]", v) be a game where v is a linear separator given 
by w and q. If Wi >Wj>Q then Xi{Q) P Xj{Q)- 

Corollary A.6 shows that y is monotone in feature weights, a complementary result 
shows that increasing a feature’s weight would result in an increase in influence. Next, 
we show that Corollary A.6 holds even when weights are negative. 

Lemma A.7. Let Q = {{1, 2}, [0,1]^, v) be a 2-feature linear separator with > 0 
andw 2 < 0. Then xi{Q) > X 2 {Q) ifandonlyif\wi \ > |w 2 |- 

Proof. We begin by assuming that q > 0. First, suppose that wi < q. In that case, for 
all (xi, X2) G [0,1]^, we have xiWi -b X2W2 < xiWi < wi < q, so r;(xi, X2) = 0 for 
all (xi, X 2 ) G [0,1]^. In particular, Xi{G) = X 2 {Q) = 0 and we are done. 

We now assume that wi > q. We show that the claim holds by direct computation 
of Xi, X 2 - We start by computing Xi(^^)- By definition, Xi(^^) equals 

J I{v{xi,X 2 ) = 1)9x1 J I{v{yi,X 2 ) = 0 )dyi^ 8 x 2 

which equals 

r f r f i(t/i < 9x2 (7) 

Jo \Jo wi Jo wi J 

The internal integrals in (7) are zero whenever f. [0,1]. We know that > 

0 for all X 2 G [0, Ij; however, < 1 only when X 2 < ■ This inequality is 
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non trivial only if ^ 1- This happens only when q > wi + W2- Therefore, we 
distinguish between two cases; the first case is when q> wi -\- W 2 , and the second is 
when q < wi + W2- In the second case, since q > 0 , wi + W2 > 0 as well, hence 
|wi I > |tC 2 1. In the first case we have: 


xt(^)=/-“^ ( 1 -^) 

Jo \ Wi J \ Wi J 

{wi - q)'^{ 2 q + Wi) 


dx 2 


&{-W2)w'l 


In the second case we have 


q - X2W2 \ q- X2W2 


xi{G) = / 1 - , , / 

Jo \ Wi J \ Wi J 


) 


dx 2 


Qq{wi + W 2 ) — 6q^ — W2{3ivi + 2 W 2 ) 


Qwl 


Now, let us proceed to compute X 2 (G)- We have that X 2 (G) equals 

J I(t;(xi,X2) = 1)9x2 J I(x(xi, 2 / 2 ) = 0)92/2^ 9xi 


which equals 


10 \Jo 


-W2 Jo 


— W 2 


( 8 ) 


(9) 


Kx2 < -) 9 x 2 / 1 ( 2/2 >- )oy2 0x1 


( 10 ) 


Again, the internal integrals in (10) are not zero only if € [0,1]. > 0 

if and only if Xi > and < 1 if and only if xi < . This inequality is 

non-trivial only if < 1, which happens only when q < wi + W 2 - Thus, we again 
distinguish between the case when q > rui + ^2 and the case when g < rci + tt; 2 . In 
the first case, we have 


X 2 {G)=f 

J_ 3 _ \ -W 2 J \ -W 2 J 

w-i 

_ {wi - qf{ 2 q - 2 wi - 3^2) 

(SW 2 W 1 


9x2 


and in the second case, X 2 {G) equals 



XiWi — q 
— W 2 


9X2 


—W 2 

6 wi 


( 11 ) 


( 12 ) 


18 



Let us compare the values when q > wi + W 2 - 

Xi{S) >X 2 (G) 

{wi - qf{2q + Wi) ^ {wi - qf{2q - 2wi - 8 ^ 2 ) 
6( — W2)w'f ~ 6W2W1 

2 q + wi ^ 2(7 — 2 wi — 3 tU 2 

Wi ~ -W 2 

{—W 2 ){ 2 q + Wi) >wi{2q — 2wi — 3w2) 
wi{wi + W2) >q{wi + W2) 


( 13 ) 


Thus, (13) holds with equality if rui = —W 2 , Xi{G) > X 2 {Q) if > —W 2 (since 
wi > (7 > 0 by assumption), and xi (G) < X 2 (G) otherwise. For the second case, we 
have 


xi{G) >X2(G) 

6 q(wi + W2) — 6 q^ — W 2 ( 3 wi + 2w2) ^ -W2 

Gwf ~ 6 wi 

6q{wi + W 2 ) - — W2(3wi + 2 w 2 ) ^ 

- > — W 2 

Wl 

Qq{wi + W2) — Qq^ — W 2 { 3 wi + 2w2) >{—W2)wi 
6 q{wi + W2) — Qq^ — 2 w2(wi + W2) >0 
{ 3 q - W2)iwi + W2) > 3 q^ 


(14) 


Now, (14) holds with equality if wi W 2 = 0, since then (7 = 0 as well. Finally, if 
W 1 +W 2 > 0, then it holds with strict inequality since W 1 +W 2 > q and 3q — W 2 > 3q, 
and we are done. 

Next, let us assume that (7 < 0. We again directly compute xi{G) and X 2 (G)- First, 
if W 2 > q, then xiwi + X 2 W 2 > X 2 W 2 > W 2 > q for all (a;i,a; 2 ) € [ 0 , 1 ]^; hence 
Xi(G) = X 2 iG) = 0, and the claim trivially holds. We now assume that W 2 < < 7 . 
Again, we have that Xi{G) equals 

( CHxi < f 1(771 > dx2 (15) 

Jq \Jo Wl Jo Wl ) 


We need to have G [0)1]- > 0 if and only if X 2 > Since W 2 < q, 

this value is always less than 1. Moreover, < 1 if and only if X 2 < . 

This inequality is not trivial only if < 1, which happens whenever q> W 2 + wi. 
Thus, when q > wi + W 2 , Xi{G) equals 



Wl 

— 6 w 2 
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and when q < wi+ W 2 , Xi(S) equals 


q — X2W2 

_s_ \ Wi 




q - X2W2 \ 


Wi 


- 


dx 2 = 


{q — W2y{2q - 2 w 2 - Swi) 
6 w 2 wf 

For X2(0), we employ a similar reasoning. First, x^iG) equals 




— W 2 


1 ( 2/2 > — -)dy 2 ) dxi 


— W 2 


(16) 


And again, —- € [0,1] if and only if xi < 


Note that since W2 < q. 


> 0. This constraint is only meaningful when q < wi + W2- Thus, when 


q > wi + W2, we have that X2{G) equals 

''xiwi — q 
— W2 


1 - 


XiWi — q 
— W 2 


dx\ = 


and equals 


— 6 q{wi + W2) + wi( 3 w 2 + ‘ 2 -wi) 
6 w 2 

(g - W2£(2q + W 2 ) 

6 W 2 W 1 


otherwise. 

Next, we compare the values we obtained. When q > wi 
wi + W 2 < 0, and in particular, |w 2 | > |rci |. Moreover, 

6 g^ — 6 q{wi + W2) + wi{iw2 + 2 wi) 


W 2 , we have that 


6 w 2 


> 


Wi 


—6g^ + 6g(ri;i + W 2 ) — wi{3w2 + 2wi) 


— W 2 


■ — 6 'u ;2 

>Wi 


— 6 q^ + 6 q(wi + W2) — wi{ 2 w 2 + 2 wi) >0 
(3g - wi){w2 + wi) >3q^ 

Under our assumptions, this inequality holds, and we are done with the first case. For 
the second case, 

(g - W2£{2q - 2 w2 - 3wi) ^ _ (q - W2£{2q + W 2 ) , . 


Qw 2 w\ 

2w2 + 3 wi — 2 q 


> - 


6 W 2 W 1 
2 q + W2 


Wi —W 2 

{—W2){2w2 + 3 wi — 2 q) >wi{— 2 q — W2) 

i-W2){wi +W2) >(-g)(wi +W2) 

Since W2 < g, this inequality holds with equality when wi = —W2, it is strict whenever 
|wi| > |w 2 |, and the reverse holds when |wi| < |r(; 2 |- □ 
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We are now ready to complete the proof of Theorem 3.1. 


Proof of Theorem 3.1. We have shown the case where Wi > Wj > 0 in Theorem A.5. 
We have also shown this to be true for two features in Lemma A.7. We just need to show 
that Lemma A.7 extends to the case of arbitrary players. Suppose that jrci | > \wj\. Let 
us write Xi((A^, w; q)) to be the influence of i under the linear classifier defined by w 
and q. We observe that 


X^{{N,w;q)) 


Xiiiihj}, {m,Wj);q 


XkWk)) 


> 



Xjiiifj}, {m,Wj),q 


=Xj{{N,w;q)) 


xkWk)) 

k=iti,j 


which concludes the proof. 


□ 


B Proof that x satisfies (D), (Sym) and (DU) 

We show that x satisfies the three axioms. If f(a_j, b) = v(a) for all a € A and all 
b G Ai, then |w(a_i, 6) — f (a)| = 0, and in particular, Xi(^) = 0; hence, x satisfies 
the dummy property. Suppose we are given a bijection ai : Ai ^ Ai. We observe that 

X^{G) - w(a)| 

' ' aeA bGAi 

X] X] k(a-i,cr,(6)) - v(a_i,CT,(6'))l 

' ' a-iGA-i b'eAi beAi 

=T^ l^'<^i(a-i,&) - Va,(a-i,6')l 

' ' a..ieA-ib'eAibeAi 

= 7 ^X 1 k,T,(a-i,&) - Va,(a)| = 

aSA b'eAi beAi 

SO X is invariant under permutations of feature states. Similarly, for any bijection cr : 
N ^ N, Xi{G) = XrT(i)(<xG)', therefore, x satisfies symmetry. 

Given a set i? C A and a feature i, let us write Wa_i{B) = {a. G B \ v(a) = 
l,a_i = a_i}, and Lii,_.{B) = {a G B \ v{a) = 0, a_i = a_J. We ob¬ 
serve that Wa_^{B) n Wa_iiB) = La_i(B) n L^_fB) = 0; moreover, L{B) = 
Ua_iGA_i ^a_i(S) and 1L(B) = Ua_iGA_i Now, given some B G A, let 
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US take some W' C W{A) \ W{B). 


X^{W{B),L{B))=Y. E l«(a-^,&)-t^(a)| 

aG-B b^Ai'. 

(a_i,6)GB 

= E E k(a-i,&) - v(a)| 

a.GW(B) bGAn 

(a_i,b)G-B 

+ E E k(a-i,&) - v(a)| 

a.GL{B) b^Af. 

{a-i,b)eB 

Next, we observe that the first summand equals 

E E via) -v{a_i,b), 

aeW{B) b^A,: 

{a-i,b)GB 


which equals 

E E E via)-via_i,b) (17) 

a_iG,4aGWa_ (B) bt^An 

' {aL-i,b)GB 

Now, v(a) — via-i, 6) = 1 if and only if via^i, b) = 0; that is, if (a_j, b) € L^ ^iB). 
Thus, Equation (17) equals 

E E \L^AB)\= (18) 

a_iG,4aGWa_i(B) 

E \w..m\\B.AB)\ 

a_iGA 

A similar construction with W' shows that 

X^iW',LiB))= \KJ-\L>.AB)\; 

a-i€A-i 

since W (B) and W' are disjoint, x satisfies the disjoint union property. 

C Relation to Classic Values in TU Cooperative Games 

Our work generalizes influence measurement in classic TU cooperative games. We 
recall that a cooperative game with transferrable utility is given by a set of players 
N = {!,... ,n}, and a function v : 2^^ —5- R, called the characteristic function. A 
game is defined by the tuple Q = {N, v). We say that a game Q is monotone if for all 
S CT <Z N,viS) <viT). 

Classic literature identifies two canonical methods of measuring feature influence in 
cooperative games, the Shapley value [Shapley, 1953], and the Banzhaf value [Banzhaf, 
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1965]. We begin by providing the following definitions. Given a set S' C iV and a 
player i, we let mi{S) = v{S U {t}) — v{S) denote the marginal contribution of i to 
S. The value mi{S) simply describes the added benefit of having i join the coalition 
S. Let n(A^) be the set of all bijections from N to itself (also called the set of permu¬ 
tations of N)\ given some cr G n(7V) We let Pi(a) = {j G N \ a{j) < cr(z)} be the 
set of the predecessors of i under cr. We define mi{a) = v{Pi{a) U {t}) — v{Pi{a)). 

Definition C.l. The Banzhaf value of a player i G N is given by 

Pi{Q) = 

SCN 

The Banzhaf value takes on a simple probabilistic interpretation; if we choose a set 
S uniformly at random from N, the Banzhaf value of a player is his expected marginal 
contribution to that set. 

Rather than uniformly sampling sets, the Shapley value is based on uniformly sam¬ 
pling permutations. 

Definition C.2. The Shapley value of a player i G N is given by 

^ ^ u(Pi(cr)U{i})-t;(Pi(cr)). 

■ (Ten(Ar) 

Intuitively, one can think of the Shapley value as the result of the following process. 
We randomly pick some order of the players; each player receives a payoff that is equal 
to his marginal contribution to his predecessors in the ordering. The Shapley value is 
simply the expected payoff a player receives in this scheme. 

When we sample sets uniformly at random from N \ {z}, we are heavily biased 
towards selecting sets whose size is approximately n/2. When measuring influence 
according to the Shapley value, we are no longer biased towards any set size. One can 
think of the Shapley value is measuring a player’s expected marginal contribution to 
a set S, where S is chosen according to the following process. First, we pick some 
fcG{0,...,n — 1} uniformly at random, and then we pick a set of size k uniformly at 
random. 

We observe that our classification setting is a generalization of TU cooperative 
games. Think of each player as a feature that can take on two values: 0 (corresponding 
to “absent”), and 1 (corresponding to “present”). An immediate observation is that 
( coincides with the Banzhaf value for TU cooperative games. Is there some natural 
extension of the Shapley value for general classification tasks? 

Our work provides a negative answer to this question. We observe that Theo¬ 
rem D.l states that the only value that satisfies the dummy, symmetry and linearity ax¬ 
ioms is C- When reduced to the cooperative game setting, we obtain axioms that were 
used to axiomatically characterize both the Shapley and the Banzhaf values [Lehrer, 
1988; Shapley, 1953; Young, 1985]. 

The dummy axiom (Definition 2.2) reduces to the following: a player i G N is a 
dummy if for all SCN, v{S U {z}) = v{S). Thus, the dummy axiom requires that if 
a player is a dummy, then his value should be zero. 
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The symmetry axiom (Definition 2.1) reduces to the following; given a game Q = 
{N, v), and some i,j € N, let us define O' = {N, v') as follows: for all S C N\{i,j'\, 
v'{S) = v{S), and r)'(S' U {i, j}) = t;(5'U {i, j}); however, v'{S U {i}) = t;(5' U {j}) 
and v'{S U {j}) = v{S U {i}). A value (p satisfies symmetry if (j)i{G) = (( 70 - 

Symmetry reduces to saying that if we replace v{S) with v{S \ i) for all S such that 
i € S, and replace v{S) with S U {i}) for all S such that i ^ S, then the total influence 
of a player (i.e. his influence when being absent plus his influence when present) does 
not change. 

Additivity as defined in Definition 2.3 is also naturally applied to TU cooperative 
games and is equivalent to the definition given in other axiomatic treatments of values 
in cooperative games. 

It is well-known that both the Banzhaf and Shapley values satisfy the dummy, sym¬ 
metry and additivity axioms, and indeed. Proposition 2.5 applies to them both: the 
Banzhaf value (and Shapley) of a player only measures the effect of player i joining 
a coalition, but not the effect of him leaving it. These two values, however, sum to 0. 
Indeed; 

PiAG) + MG) =:^ E U {*}) - t;(^) 

SCAT 

SCAT 

SCN\{i} 

SCAr\{i} 

=0 

Theorem 2.8 characterizes x as the unique value to satisfy the dummy, symmetry and 
disjoint union properties. 

Going back to the classification setting, it is easy to see that Definition 2.6 implies 
that for CCA and any two sets B, B' C A\ C, (j)i{B, C) -f 4>i{B', C) = (j)i{B C 
B',C) +phiiiBnB',C). 

One can directly interpret the DU property in TU cooperative games. Given a game 
0 = {N, v) and a subset B of 2^, both the Shapley and Banzhaf values can be defined 
to ignore any elements that are not contained in B. It is easy to see that Theorem 2.8 
implies the uniqueness of x for TU cooperative games, and that it equals the Banzhaf 
value. Thus, Theorem 2.8 can be seen as an alternative axiomatization of the Banzhaf 
value, this time from the binary classification perspective. 

D Axiomatic Approach to State Influence 

Section 2 provided an answer to questions of the following form: what is the impact of 
gender on classification. The answer provided in previous sections was that influence 
was a function of the feature’s ability to change outcomes by changing its state. 
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It is also useful to ask a related question: suppose that a certain search engine user 
is profiled as a female. What is the influence of this profiling decision? In other words, 
rather than measuring feature influence, we are measuring the influence of feature i 
being in a certain state. 

For a feature i € N and a state b € N, we can ask what is the influence of the state 
b, rather than the influence of i. That is, rather than having a value (j)i{G) for a feature 
z G TV, we now study the influence of the state b G Ai, i.e. a real value (j)i^b{G) for each 
i G N and b € Ai. 

While Proposition 2.5 implies that wy feature influence measure that satisfies the 
dummy, symmetry and additivity axioms must be trivial, this result does not carry 
through to measures of state influence. 

Dummy (D): given i € N and b G Ai, we say that a satisifies the dummy property if 
whenever v{a-i, b) = v{&) for all a G A ai^b = 0. 

Symmetry (Sym): Two states b, b' G A are symmetric if for all a € A, v{a-i, b) = 
v{a-i, b'). A value a satisfies symmetry if ai^b = oti.b' whenever b and 6' are 
symmetric. 

Linearity (L): Given games Gi = (TV, A, vi) and G 2 = {N, A, vf), let us write G = 
(TV, A, v) where v = vi + V 2 . We assume that vi and V 2 are such that v is still 
a function with binary values (i.e. if vi(a) = 1 then z) 2 (a) = 0). A value a is 
linear if ai^b(G) = a^^b(Gi) + a*, 6 ((z 2 )- 

Let us define 


(i,b(G) = 

We let ( denote the value ( without the normalizing factor We refer to ( as the raw 
version of (. In Theorem D.l, we show that ( is the unique (up to a constant) value 
that satisfies the symmetry, dummy and linearity axioms. 

Theorem D.l. If a value (j) satisfies the (D), (Sym), and (L), then = c^, where c is 
an arbitrary constant. 

Proof. Let us observe that every game v ■. A ^ {0,1} can be written as the disjoint 
sum of unanimity games; namely v = X]aeA j)(a)=i Thus, it suffices to show that 
the claim holds for unanimity games. 

Let Us, = (TV, A, t6a); we show that A,b(^a) equals Ci,b(^a)- First, if 6 = then 
Ci,b(^a) = |A| — 1', if b di, then Ci.bibfa) = —1- Now, by symmetry, we have that 
fii.biUa) = fi,b' (Ua) for all b, b' f ai. If we write A,h(^a) = y for all b ai, and 
fii.aiifia) = X, then according to Proposition 2.5, ^b^a V + x = 0, which implies 
that X = —y{\Ai\ — 1 ). Finally, according to feature symmetry, the value of y cannot 
depend on i, and is equal for all j G TV. We conclude that for all z G TV and all b G Ai, 

f^.b{G) = QAG). □ 
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As a direct corollary of Theorem A.5, we have that the unique (up to a constant) 
state value to satisfy (Sym), (D) and (DU) axioms (see Definitions 2.1, 2.2 and 2.6 in 
Section 2) is 

Xi,b{s) = 

slGA 


E Influence in Weighted Settings 

Unlike previous sections, let us assume that there is some weight function ru : A —5- K. 
that assigns a non-negative weight to every state vector, w can be thought of as a 
prior distribution that governs the likelihood of observing a state vector a. € A. Given 
B C A, let w{B) denote w{a). We also write for a given b € Ai, wib \ a_i) = 

Xa_ieA_i w{a_i,b)-, for a given a_i e A_i, we write w(a„i) = w{a^i,b). 

Given this definition, let us rethink the disjoint union property. Given a set of winning 
state vectors W C A and a set of losing state vectors L C A, we can think of a 
weighted influence measure as a function (j)i of W, L and w : A ^ R+. 

Fix some CCA. Given two functions w,w' : A ^ K.+ that agree on C (i.e. 
w{a) = w'{a) for all a G C), and some i? C A \ C, let us write 


w ©B w'{a) 


w{a) if a G C 

w{a)+w'{a) if a G B. 


Definition E.l. We say that an influence measure satisfies weighted disjoint union 
(WDU) if for any disjoint B, C C A and any two weight functions w,w' : A ^ R+ 
that agree on C, we have that (j)i{B, C, w) + 4n{B, C, w') = 4>i{B, C,w ®b w')- 

Lemma E.2. Weighted disjoint union implies the disjoint union property. 

We again write Wa_j = {(a_i, 6) G A | v(a_j, b) = 1}, and La_i = {(a_i, &) G 
A I v{a^i,b) = 0}. 

Given a weight function w : A ^ R+ and a game Q = {N, A, v), let 


X^iG,w) = ^ w(a) w{b\a_i)\v{a_i,b) -w(a)|. 

agA beAi 

Let us extend the symmetry axiom (Definition 2.1) to a weighted variant. Given a 
weight function w : A K._|_ and a bijection a over A^ or N, we let crruja) = w{aa). 

Definition E.3. Given a game Q = (TV, A, v) and a weight function lu : A —> K., we 
say that an influence measure (p is state-symmetric with respect to w (Sym-w) if for any 
permutation cr : A^ —> Aj, and all j G N, (f)j{aQ, aw) = (pjiG, w). That is, relabeling 
the states and letting them keep their original distributions does not change the value 
of any feature. Similarly, we say that an influence measure (p is feature-symmetric if 
for any permutation a : N ^ N, (pcr{i} i^G, aw) = <pi{G, w). That is, relabeling the 
coordinate of a feature does not change its value. 
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Theorem E.4. If a probabilistic influence measure f satisfies (D), (Sym) and (DU) 
with respect to some D, then 


fflg,v) = cx^{g,v). 

Before we proceed, we wish to emphasize two important aspects of Theorem E.4. 
First, if we set p(a) = then we obtain Theorem 2.8. In other words, x an 
influence measure that assumes that all elements in the dataset are equally likely. 

Another point of note is the underlying process that the influence measures entail. 
If we assume that the weight function describes a distribution over A, one can think of 
the influence measure as the following process. We begin by picking a point from A 
at random (uniformly at random in the case of x, and according to w in Theorem E.4); 
next, fixing the states of all other features, we measure the probability that i can change 
the outcome, by sampling a different state according to the distribution w(- \ a_i). 
Before we prove Theorem E.4, let us prove the following lemma. 

Lemma E.5. 

X^{g,w) = 2 ^ w{sL^i)w{W^_f)w{L^_^) 


Proof 

X^{Q) = XI I a-^)\v{a_i,b) - v(a)| 

aG.4 bGAi 

= 2 Y, X X w{c I a-i)w{b I a_i) 

cGAi: bGAi: 

v{a^i,c)—0 v{a-i,b) — l 

= 2 Y X w{c\ a_i)w{Wa_i) 

cGAi: 

v(a—i,c)—0 

= 2 Y w{a-i)w{Lg__.)w(Wa._J 

a_iGA 


□ 


Lemma E.6. Let f : ^ M. be a function that satisfies 

(i) f{x,0)=f(0,y) = 0. 

(ii) f{xi,y) + f{x 2 ,y) = f{xi +X 2 ,y). 

(Hi) f(x,yi) + f(x,y 2 ) = f(x,yi + 2 / 2 )- 

Then there is some constant c such that f{x, y) = cxy. 

Proof First, we show that f{rx,y) = rf(x,y) for all r G M. Given any n G N, 
f{nx,y) = n/(a;, 2 /) by property (2). Similarly, /(f, 2 /) = ^f{x,y). Thus, for 
any rational number q G Q, we have f(qx,y) = f(x,qy) = qf(x,y). Now, take 
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any real number r € K.. There exists a sequence of rational numbers {qn)’^=i such 
that lim„_s.oo gn = r. Thus, f{rx,y) = lim„^oo/(<7na;, y) = lim„^oo <?n/(a;, y) = 
rf{x, y) (and similarly f{x, ry) = rf{x, y)). 

Let us observe the partial derivatives of / at x 7 ^ 0: 


5/^* r fix*+e,y*) - fix*,y*) 

— (x ,y ) = lim- 

dx^ e 


= lim 

s —^0 


i^-^-l)f{x*,y*) f{x*,y*) 


and similarly ^{x*,y*) = ^ . We obtain the following differential equation: 

x^ — / = 0. Its only solution is /(x, y) = g{y)x + h{y). However, since f{0,y) = 0 
for all y, we get that h{y) = 0. Similarly, /(x, y) = k{x)y. Putting it all together, we 
get that/(x,y) = cxy. 

□ 


Lemma E.7. If a value (j) satisfies the (WDU) and (Sym-w) property, then it agrees 
with on any game Q = ({i}, Ai, v) with any weight function w : Ai —)■ IR. 4 . 

Proof Let us write Wi and Li to be the winning and losing states in Ai. By state 
symmetry we know that <j) is only a function of and . By the 

weighted disjoint union property, we know that 

<i)iiw{b),w{c)). 

bGWi cGLi 

Using the (WDU) property, we know that the following holds for single-feature games 
with only two states. Given xi,X 2 ,y € M+, the following holds: 

fiixi + X2,y) =fi(xi,y) + fi(x2,y) 
fiiiy, xi + X2) =fi(y, xi) + fi(y, X2) 

By Lemma E.7, we know that ffix, y) = cxy = cx^ix, y). In particular, this implies 
that (j)i{G, w) = X 4 iG^ w), and we are done. □ 

Proof of Theorem E.4. First, we note that x^ satisfies (D), (Sym-ru) and (WDU) (this 
is an easy exercise). We write W to be the winning state vectors in A and L to be 
the losing state vectors in A. Now, if either w{W) = 0 or w{L) = 0, any influence 
measure that satisfies (D) assigns a value of zero to all i G N, and the claim trivially 
holds. Thus, we assume that w{W), w{L) > 0. 

Next, according to the (DU) property, we can write 

4)i{W,L,w)= ^ W)- 

a_ieA_i 

The argument is the same as the one used for the decomposition of x in Theorem 2.8. 
By the above lemmas, (f>i{Wa__^ , La^_^,w) = Cxf (Wa—i i La-i,w). Note that by fea¬ 
ture symmetry, it must be the case that the constant C is independent of L □ 
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F Generalized Distance Measures 


Suppose that we have a set of feature vectors B C A. In previous sections we had 
assumed that there was some function v ■. A ^ {0,1} that classified a vector as 
either having a value of 0 or a value of 1. We then proceeded to provide an axiomatic 
characterization of influence measures in such settings. Influence was largely based on 
the following notion: a feature i G N can influence the vector a € if |r;(a_i, b) — 
v(a)| = 1. Let us now consider a more general setting; instead of defining a classifier 
over data points, we have some semi-distance measure over the vectors. Recall that a 
pseudo-distance measure is a function d : A x A —K. that satisfies all of the distance 
axioms, but d{a., b) = 0 does not necessarily imply that a = b. Given some pseudo¬ 
distance measure d over A, rather than measuring influence by the measure |v(a_j, b) — 
v(a)|, we measure influence by (i((a_j, b), a). 

We observe that if d{a, b) G {0,1} for all a,h G A, then we revert to the original 
setting. 

Given a pseudo-distance measure d over A and a dataset B C A, let us define 
Vd{B) to be the partition of B into the equivalence classes defined by a ~ b iff 
(i(a, b) = 0. In other words, Vd{B) is the clustering of B into points that are of equal 
distance to each other. Fixing a pseudo-distance d, we provide the following extensions 
of the axioms defined in Section 2. 

We keep the notion of symmetry used in Section 2 (Definition 2.1): an influence 
measure satisfies symmetry if it is invariant under coordinate permutations, both for 
individual features (e.g. renaming males to females and vice versa should not change 
the influence of any feature), and between the features (e.g. renaming gender and age 
should not change feature influence). We do, however, adopt more general definitions 
of the dummy and disjoint union properties. 

Definition F.l (d-Dummy). We say that an influence measure satisfies the d-Dummy 
property if (j)i{B) = 0 whenever d{{a_i, b),a) = 0 for all a € i? and all b G A^ such 
that (a_i, b) G B. 

Definition F.2 (Feature Independence). Let i? C A be a dataset, and let B{a-i) = 
(b G i? I b_i = a_i}. An influence measure satisfies feature independence (FD) if 

= X! (i>i{B{a-i)). 

Definition F.3 (d-Disjoint Union). Let i? C A be a dataset, and let B = {Bi ,..., Sm} 
be the equivalence classes of B according to the pseudo-distance d. An influence 
measure (p satisfies the d-disjoint union, if for any j G {1,..., mj, any partition C, C 
of Bj satisifies 


...,Bm) = (pi{B-j,C) + HB.„C') - (t>^{B.,). 

Finally, the following axiom requires that in very minimal settings, a feature’s in¬ 
fluence should agree with d. 

Definition F.4 (Agreement with Distance). 
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Given a dataset B C A, define 

xt{B) = Y, E diia_i,b),a) (20) 

aes bGAi-.(a^i,b)&B 

Lemma F.5. Let B be a dataset of single-feature points. Then if f satisfies, d-(D), 
d-(DU), (Sym), and (AD), then (j){B) = x'^{B) 

Proof Sketch. We partition i? into its equivalence classes according to d, B = {Bi,... ,Bm}. 
In an argument similar to Lemma 2.7, we can show that the symmetry axiom implies 
that (() is a function of , \Bm\. Let Wj = \Bj\-, employing the d-disjoint union 

property and the dummy property, we obtain that there exists some m x m matrix D' 
such that (j){B) = w"^D'w, and D' is 0 on the diagonal, non-negative, and symmetric 
(symmetry here is obtained via state symmetry). 

To show that D' must identify with the pseudo-distance, we employ the agreement 
with distance axiom on inputs to f that have only two non-zero coordinates, to obtain 
the desired result. □ 

Theorem F.6. If an influence measure (j) satisfies the d-dummy, d-disjoint union, sym¬ 
metry and agreement with distance axioms, then 

(l)fiB)=aY^ E d{{a-i,b),a), 

aeB b&Ai:{a-i,b)GB 

where a is a constant independent of i. 

Proof Sketch. The proof mostly follows the proof technique of Theorem 2.8. Let us 
write the influence of i under d to be ff(A). 

Using the (FI) property, we decompose into \A-i \ different single-feature datasets. 
Next, we apply Lemma F.5 on each of the datasets to show that identity holds. □ 
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